Authors


Hector R. Gavilanes Chief Information Officer
Gail Han Chief Operating Officer
Michael T. Mezzano Chief Technology Officer


University of West Florida

November 2023

What is PCA?

  • Principal Component Analysis.
  • Dimensionality reduction technique.
  • Purpose:
    • Simplification of complex datasets.
    • Preservation of important information.

Why Use PCA?

  • Reducing Dimensionality: Simplify high-dimensional data.
  • Visualizing Data: Help visualize data in lower dimensions.
  • Noise Reduction: Eliminate less relevant features.
  • Improved Model Performance: Enhance machine learning efficiency.

Methods

  • Data matrix \(X\) of size \(N\) x \(P\).
  • Data is linearly related.
  • Continuous and normally distributed data.
    • Initial data distribution does not truly matter.
  • Variables are similar in scale and without extreme outliers.
  • Missing data: Imputation or removal of observations.
  • Centering and scaling: Transform variables to a mean of 0 and a standard deviation of 1. \[ z_{np} = \frac{x_{np} - \bar{x}_{p}}{{\sigma_{p}}} \]
  • Covariance: A measure of how two random variables vary together. \[ Cov(x,y) = \frac{\Sigma(x_i-\bar{x})(y_i-\bar{y})}{N} \]
  • Covariance Matrix: Symmetric \(p \times p\) matrix which gives the covariance values for each pair of variables in the dataset.
  • Nonzero vector whose direction is unaffected by a linear transformation.
  • An eigenvector is scaled by factor \(\lambda\), the eigenvalue.
  • Each principal component is given by the eigenvectors of the covariance matrix.
    • The eigenvectors represent the directions of the new principal axes.
    • The eigenvalues represent the magnitude of these eigenvectors.

Finding the Principal Components

  • Find the linear combination of the columns of \(X\) (the variables) which maximizes variance.
  • Let \(a\) be a vector of constants \(a_1, a_2, a_3, …, a_p\) such that \(Xa\) represents the linear combination which maximizes variance.
  • The variance of \(Xa\) is represented by \(var(Xa) = a^TSa\) with the covariance matrix \(S\).
  • Finding the \(Xa\) with maximum variance equates to finding the vector \(a\) which maximizes the quadratic \(a^TSa\), where \(a^Ta = 1\).
  • \(a\) is a unit-norm eigenvector with eigenvalue \(\lambda\) of the covariance matrix \(S\).
  • The largest eigenvalue of \(S\) is \(\lambda_1\) with the eigenvector \(a_1\), which we can define for any eigenvector \(a\): \[ var(Xa) = a^TSa = \lambda a^Ta = \lambda \]

Principal Components

  • Impose the restriction of orthogonality to the coefficient vectors of \(S\).
    • Ensure the principal components are uncorrelated.
  • The eigenvectors of \(S\) represent the solutions to finding \(Xa_k\) which maximize variance while minimizing correlation with prior linear combinations.
  • Each \(Xa_k\) is a principal components of the dataset having eigenvectors \(a_k\) and eigenvalues \(\lambda_k\).
  • The elements of \(Xa_k\) are the factor scores of the PCs.
  • The elements of the eigenvectors \(a_k\) represent the loadings of the PCs.
  • The elements of \(Xa_k\).
  • How each observation scores on a PC.
  • In a geometric interpretation of PCA the factor scores measure length (magnitude) on the Cartesian plane.
  • This length represents the projection of the original observations onto the PCs from the origin at \((0, 0)\).
  • The elements of the eigenvectors \(a_k\) represent the loadings of the PCs.
  • The loadings represent the weights of the original variables in the computation of the PCs.
  • The correlation from -1 to 1 of each variable with the factor score.
  • Eigenvectors: Represent directions of maximum variance.
  • Eigenvalues: Indicate the variance explained by each eigenvector.
  • Sorting: Sort eigenvalues in descending order to select the most significant principal components.

Example

  • For this example of PCA, the Abalone dataset from the UCI Machine Learning Repository is used.
  • This dataset contain 4177 observations of 9 variables which record characteristics of each abalone including sex, length, diameter, height, weights, and the number of rings.
  • The variables, apart from sex, are continuous and correlated.

Preprocessing the data

  • Exclude non-numeric variables from the dataset.
    • The variable Sex is excluded.
  • Check for missing data.
    • No missing data in the dataset.
  • Scale and center the data.
  • Check for and handle extreme outliers.
    • Outliers do not present a large problem.

Perform Principal Component Analysis

The prcomp() function performs principal component analysis on a dataset using the singular value decomposition method with the covariance matrix of the data.

  • The standard deviation for each PC represents the information captured by that principal component.
  • The proportion of variance is the percent of total variance captured by each PC.
  • The cumulative proportion gives the total variance caputured by the PC and all prior PCs.

Visualizing the results

Interpreting the results

  • The loadings of the first two principal components show the contribution of each variable to PC1 and PC2.

Variance Explained

  • Explained Variance Ratio: Calculate the ratio of each eigenvalue.
  • Cumulative Variance: Plot cumulative explained variance to determine components to retain.

Objective

  • Weighted Combination
  • Maximal Variance Components

High Variance vs.

Low Variance

Dimensionality Reduction

  • Unsupervised Learning.
  • Reduce Dimensions: Transform data by multiplying with selected eigenvectors.
  • New Feature Space: Data exists in a lower-dimensional feature space.

Visualization

  • Data Projection: Visualize data in the reduced feature space.
  • Scatterplots: Use scatterplots to visualize data distribution.

Assumptions and Limitations

  • Interpretability: Loss of interpretability in transformed features.
  • Loss of Information: Reducing dimensionality may result in some information loss.
  • Scaling: Data scaling is important to avoid feature dominance.

Applications of PCA

  • Image Compression: Reduce image size while preserving details.
  • Face Recognition: Reduce facial feature dimensions for classification.
  • Anomaly Detection: Identify anomalies in large datasets.
  • Bioinformatics: Analyze gene expression data.

Dataset

  • 39 variables, or features.
  • 56 observations.
  • State-level averages.
  • “State” represents 50 states & 6 U.S. territories.
  • Administered to In-Center Hemodialysis Survey.
  • Dialysis Quality measures.
  • 24 features: patient care quality ratings,
  • Transfusions, fistula usage, infections,
  • Hospitalizations, incident waitlisting & readmissions.
  • 14 features: dialysis adequacy (Kt/V), type of dialysis,
  • Serum phosphorus level, & average hemoglobin level,
  • Normalized protein catabolic rate (nPCR), hypercalcemia level.

Dataset Summary

  • PCA is designed for continuous numerical data.

  • Categorical index feature removed from model.

Dataset Selection Rationale

  • Driven by multicollinearity.

  • Features less significant in explaining variability.

  • All variables are numeric

  • Categorical Index variable.

Data Preparation

  • Efficient removal of white spaces in the dataset.

  • Editing variable names to enhance readability and meaningful.

Original: “Percentage.Of.Adult..Patients.With.Hypercalcemia..Serum.Calcium.Greater.Than.10.2.Mg.dL.”

Edited: “hypercalcemia_calcium > 10.2Mg.”

Missing Values

  • 34 missing values.

  • Imputation of missing values using the \(Mean\) (\(\mu\))

Distribution

  • Normality is not assumed.

QQ-Plot of Residuals

  • Outliers are present through the entire dataset

Standardization

  • Mean (\(\mu\)=0); Standard Deviation (\(\sigma\)= 1)

    \[ Z = \frac{{ x - \mu }}{{ \sigma }} \]

    \[ Z \sim N(0,1) \]

Outliers & Leverage

  • 3 Outliers

  • No leverage

  • Minimal difference.

  • No observations removed.

Results

  • Principal component analysis was performed using a singular value decomposition approach.
  • PC1 captures 40.80% of the variance in the data.
    • PC1 and PC2 capture 50.27% of the variance.
      • The first four PCs capture 67.66% of the variance, or just over two-thirds.
  • After the fourth PC, the variance captured by each successive PC begins to diminish relative to PCs one through four.
    • The first ten PCs capture 88.67% of the variance.
      • Over 90% of the information in the dataset can be explained by the first eleven PCs.
  • The variables which contribute the most to PC1 are
    • expected_hospital_readmission
    • expected_transfusion
    • expected_hospitalization
  • PC2, which is orthogonal to PC1, has relatively large contributions from the five variables measuring levels of phosphorus.
  • Principal component regression was performed with expected_survival used as the response variable.
  • The estimates and significance of each PC regressor demonstrates the differences between variance captured from the data and usefulness in a linear model.
    • For example, PC4 is a significant regressor despite capturing less variance than PC3 in the training data.
  • Both models produced an \(R^2\) above 96% and a predicted \(R^2\) above 95% with a 1% advantage on the cross-validation model.

PCA in Machine Learning

  • Feature Selection: Use PCA to select relevant features.
  • Model Training: Enhance model performance by reducing dimensionality.
  • Preprocessing: Standardize and normalize data before applying PCA.

Conclusion

  • Summary: PCA is an unsupervised learning technique for dimensionality reduction and data visualization.
  • Key Takeaways: Understand eigenvectors, eigenvalues, and explained variance.

Questions

  • Open the floor for questions from the audience.

Thank You

  • Contact Information

References

  • List of sources and references used in the presentation.